Systems and methods for sequential event prediction with noise-contrastive estimation for marked temporal point process

ABSTRACT

Embodiments for systems and methods of sequential event prediction with noise-contrastive estimation for marked temporal point process are disclosed.

CROSS REFERENCE TO RELATED APPLICATIONS

This is a U.S. non-provisional patent application that claims benefit to U.S. provisional patent application Ser. No. 62/697,880 filed on Jul. 13, 2018, which is incorporated by reference in its entirety.

FIELD

The present disclosure generally relates to marked temporal point processes, and in particular to systems and methods for noise-contrastive estimation of marked temporal point processes.

BACKGROUND

Recent years have witnessed a booming of sequential event data in a variety of high-impact domains, ranging from the streams of reposts in microblogging platforms to the usage records of a bike in bike sharing programs. More often than not, such data carries two sources of valuable information—the type (a.k.a. feature or mark) and the timing of the event. For example, as shown in FIG. 1A, given a sequence of retweets for the tweet “AI is the new electricity!” on Twitter, the event type refers to the category of users who retweet and can either be a celebrity or an average person, and the timing of the event refers to the detailed retweet timestamp in the timeline (e.g., t₀, t₁ . . . ). With the wide availability of sequential event data, one natural question to ask is: “given observed sequential events, can the exact times-tamp of a particular event in the near future be predicted?” As in the case of bike usage sequence in FIG. 1B, we may want to know “when the given bike will arrive at which bike station soon”? This prediction task has significant implications in advancing a variety of real-world applications such as patient treatment suggestion, predictive maintenance and trending tweet prediction. To model such sequential data for downstream predictive tasks, a class of mathematical models called the marked temporal point process (MTPP) is often exploited. Given the time and feature of an event, these models jointly estimate how likely the event will happen in the near future by the conditional intensity function (CIF).

However, existing efforts are overwhelmingly devoted to the parameterization of MTPP models. Conventional studies on MTPP models are heavily focused on the design of CIF to effectively model both the event feature and the timing information. Recently, there has been a surge of research in developing RNN based MTPP models which aim to enhance the predictive power of MTPP through learning representations for event sequences. Despite empirical success in this research, little attention has been paid to the training process of MTPP models. The vast majority of existing work leverages Maximum Likelihood Estimation (MLE) to train MTPP models. However, the likelihood function of an MTPP model is often difficult to estimate because it has to be normalized by a definite integral of CIF which could be intractable to compute, especially for neural MTPP models. To alleviate this issue, existing approaches either: (1) limit CIF to integrable functions; or (2) approximate the likelihood with Monte Carlo sampling. Nonetheless, these two ways either leads to suboptimal specification of CIF or has to take the marginal distribution of event time as a priori. This is in addition to other problems of MLE such as mode dropping, which refers to the fact that MLE attempts to minimize the asymmetric KL divergence between the data distribution and the generative model. These issues inherently limit the usage of MLE for MTPP models in practice.

It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed.

BRIEF DESCRIPTION OF THE DRAWINGS

The present patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

FIG. 1A is an illustration of a real-world example of sequential event data for retweet behaviors of a tweet and FIG. 1B is an illustration of a real-world example of sequential event data for usage patterns of a bike;

FIG. 2 is a simplified block diagram of an overview of a framework according to the present disclosure;

FIGS. 3A-3F are graphical representations of time prediction results of three datasets with error bars representing one standard deviation;

FIGS. 4A-4F are graphical representations of mark prediction results of three datasets with error bars representing one standard deviation;

FIG. 5 is an example schematic diagram of a computing system that may implement various methodologies of the proposed marked temporal point process framework; and

FIG. 6 is a simplified block diagram illustrating an exemplary network/system embodiment of the MTTP framework for training models with adaptive noise sample generation as described herein.

Corresponding reference characters indicate corresponding elements among the view of the drawings. The headings used in the figures do not limit the scope of the claims.

DETAILED DESCRIPTION

Recently, various recurrent neural network (RNN) models have been proposed to enhance the predictive power of mark temporal point processes. However, existing marked temporal point models are fundamentally based on the Maximum Likelihood Estimation (MLE) framework for the training, and inevitably suffer from the problem resulted from the intractable likelihood function. The present disclosure provides a technical solution to the aforementioned technical difficulties, in the form of a machine learning training framework based on noise-contrastive estimation. In one aspect, the novel training framework may be implemented to resolve the issue of intractable likelihood function in training of MTPP models.

Preliminaries

The background knowledge to facilitate the understanding of the MTPP models will now be discussed. First, the basic concepts of MTPP will be introduced and the notations noted herein summarized. Afterwards, the existing MLE training frameworks for marked temporal point process will be discussed.

Marked Temporal Point Process

The concepts of marked temporal point process (MTPP) with the retweet sequence example (FIG. 1A) will be explained for a better understanding. A sequence of M retweets τ={(t₁, χ₁), . . . (t_(i), χ_(i)), . . . (t_(M), χ_(M))} is an instance of a marked temporal point process, where (t₁, . . . t_(M))∈

_(<o) ^(M) refers to a strictly ascending sequence of timestamps, and (χ₁, . . . , χ_(i), . . . , χ_(M)) is the corresponding sequence of d-dimensional marks (e.g., features of retweeters). The symbol x^(j) denotes the domain of definition for χ_(i) ^(j) i.e., the j-th dimension of mark χ_(i) (∀i=1, . . . , M; j=1, . . . , d). Without loss of generality, the mark of event is treated as a continuous multidimensional variable χ. For example, in a retweet sequence (FIG. 1A), the mark of an event indicates whether the retweeter is a celebrity or not, which naturally makes it a discrete and unidimensional variable. The notation τ=(t, χ) denotes the random variables of an event, and ii with subscript denotes the i-th event (t_(i), χ_(i)).

Given a history of sequence until t_(i), i.e.,

_(i)={(t₁, χ₁), . . . (t_(i), χ_(i))}, it can be characterized by the conditional intensity function (CIF) as follows:

λ(τ

_(i))=

[N(t+dt,χ|

_(i))−N(t,χ|

_(i))],  (1)

wherein dt is an infinitesimal interval around t and N(t, χ), indicates the number of events (e.g., retweets) with mark χ (e.g., user feature) in the sequence till t. For example, in FIG. 1A, CIF in Eq. (1) evaluates how likely the next retweet would be posted at timestamp t by a user with the feature χ by using the conditional intensity, which is a continuous un-normalized scalar. Using the chain rule, the CIF can be decomposed into two parts such that λ(τ=p(χ|t)λ(t), where p(χ|t) is the conditional probability of the mark χ conditioned on the timestamp t. By setting p=(χ|t)=1 (χcan only take one value), a typical unmarked temporal point process (TPP) can be used to model the conditional intensity λ(t), and two of the most popular models are expressed as follows:

A homogeneous Poisson Process comes with the assumption that inter-event time intervals are i.i.d. samples from an exponential distribution. Thus, the CIF is a constant

${{\lambda (t)} = \frac{\left\lbrack {\overset{\sim}{N}(t)} \right\rbrack}{t}},$

where N(t) counts events.

A Hawkes Process has the following formulation of CIF λ(t

_(i))=μ_(o)+αΣ_(j=1) ^(i)ϕ(t,t_(j)), where ϕ(t, t_(j))≥denotes the self-exciting kernel and μ₀∈

is a parameter.

Maximum Likelihood Estimation for Marked Temporal Point Process

With the likelihood function defined, MLE is the most widely used estimator for TPP models. In particular, given the history sequence

_(i−1), the likelihood of observing the i-th event τ_(i)=(t_(i),χ_(i)), t_(i)>t_(i−1) with the CIF λ_(θ) can be formulated as:

p _(θ)(τ_(i)|

_(i−1))=λ_(θ)(τ_(i))exp(−∫_(χ∈X)∫_(t) _(i−1) ^(t) ^(i) λθ(τ)dtdχ).  (2)

Thus, the log likelihood function of observing a sequence of N events τ=(τ₁, τ₂, . . . , τ_(N)) at time t_(N) can be written as:

$\begin{matrix} {{{\log \; {p_{\theta}(\tau)}} = {\sum\limits_{i = 1}^{N}\left\lbrack {{\log \; {\lambda_{\theta}\left( \tau_{i} \right)}} - {\int_{x \in }{\int_{t_{i - 1}}^{t_{i}}{{\lambda_{\theta}(\tau)}{dtdx}}}}} \right\rbrack}},} & (3) \end{matrix}$

where t₀=0. By maximizing the above log likelihood, the estimated model parameters θ may be obtained. However, the normalizer ∫_(χ∈X)∫_(t) _(i−1) ^(t) ^(i) λ_(θ)(τ)dtdχ is a definite integral of CIF and can often be infeasible to compute, especially when neural networks are used to parameterize CIF.

Although approximation methods such as Monte Carlo sampling can be applied to compute the normalizer and its gradients, strong assumptions have to be made. For example, it has been conventionally assumed that the events of each sequence are uniformly distributed along the continuous time space. However, such assumptions may not always hold on real-world sequential event data. Hence, this provides motivation and a technical need to develop a novel training framework for complex MTPP models.

Marked Temporal Point Process Framework

The proposed marked temporal point process framework and the principle of noise-contrastive estimation will be discussed in greater detail. In addition, the strong connection between the framework with the exact MLE will be discussed as a strong connection with MLE is often desired by existing MTPP models, will be shown. Moreover, the training process of the framework will also be discussed including a novel adaptive noise generation algorithm. Finally, an instantiation of the framework with the state-of-the-art deep learning techniques in modeling sequential data will be introduced.

MTPP with Noise-Contrastive Estimation

In noise-contrastive estimation, parameters are learned by solving a binary classification problem where samples are classified into two classes, namely true sample or noise sample. Here, true and noise samples refer to the events observed in the data distribution p_(d) and a specified noise distribution p_(n), respectively. Thus, we define p(y=1|τ) to denote the probability that the event τ is a sample observed in p_(d). Similarly, p(y=0|τ) denotes the probability that the event τ is not observed in the data but generated from the noise distribution p_(n). Intuitively, the target is to maximize p(y=1|τ) and p(y=0|τ) for those observed events and generated noise, respectively. Hence, we obtain the following objective function is obtained:

$\begin{matrix} {{{\underset{\theta}{{\arg \; \max}\mspace{14mu}}_{r \sim p_{d}}\mspace{14mu} \log \; {p\left( {y = \left. 1 \middle| \tau \right.} \right)}} + {K\; _{\tau \sim p_{n}}\mspace{14mu} \log \; {p\left( {y = \left. 0 \middle| \tau \right.} \right)}}},} & (4) \end{matrix}$

wherein K is the number of noise samples generated for each sample in the data. In MTPP, given the history sequence

_(i) and a random variable τ=(t,χ), t>t_(i), its posterior probability can be written as:

$\begin{matrix} {{{p\left( {{y = \left. 1 \middle| \tau \right.};\mathcal{H}_{i}} \right)} = \frac{p_{d}(\tau)}{{p_{d}(\tau)} + {{Kp}_{n}(\tau)}}},} & (5) \end{matrix}$

wherein p_(d)(τ) and p_(n)(τ) are short for p_(d)(τ

_(i)) and p_(n)(τ

_(i)), respectively. In detail, p_(d)(τ) denotes the probability of observing τ in the data. Similar to MLE, a family of parametric models p_(θ)(τ) is used to approximate p_(d)(τ). Following the setting of NCE, instead of computing the normalizer as in Eq. (3), we could replace the normalizer by learning a function z_(θ) _(z) (⋅) from the data. The re-parametrization and implementation of z_(θ) _(z) will be introduced later. The likelihood function of the framework is formulated as follows:

p _(θ)(τ)=λ_(θ) _(λ) (τ)exp(z _(θ) _(z) (τ

_(i))),  (6)

where θ={θ_(λ), θ_(z)} is the model parameter of the framework. It should be mentioned that directly maximizing the likelihood function in Eq. (6) over the data distribution leads to trivial solutions when the normalizer z_(θ) _(z) →−∞. With p_(θ) defined, Eq. (4) can be reformulated as:

$\begin{matrix} {{\mathcal{L}\left( \theta \middle| \mathcal{H}_{i} \right)} = {{- {_{\tau \sim {p_{d}{(\tau)}}}\left\lbrack {\log \frac{p_{\theta}(\tau)}{{p_{\theta}(\tau)} + {{Kp}_{n}(\tau)}}} \right\rbrack}} - {K\; {{_{\tau \sim {p_{n}{(\tau)}}}\left\lbrack {\log \frac{\; {p_{n}(\tau)}}{{p_{\theta}(\tau)} + {{Kp}_{n}(\tau)}}} \right\rbrack}.}}}} & (7) \end{matrix}$

Given the j-th element of θ as θ_(j), the partial gradient of Eq. (7) against θ_(j) is:

$\begin{matrix} {\frac{\partial{\mathcal{L}(\theta)}}{\partial\theta_{j}} = {{- {_{\tau \sim {p_{d}{(\tau)}}}\left\lbrack {\frac{{Kp}_{n}(\tau)}{{p_{\theta}(\tau)} + {{Kp}_{n}(\tau)}} \times \frac{{\partial\log}\; {p_{\theta}(\tau)}}{\partial\theta_{j}}} \right\rbrack}} + {K\; {{_{\tau \sim {p_{n}{(\tau)}}}\left\lbrack {\frac{p_{\theta}(\tau)}{{p_{\theta}(\tau)} + {{Kp}_{n}(\tau)}} \times \frac{{\partial\log}\; {p_{\theta}(\tau)}}{\partial\theta_{j}}} \right\rbrack}.}}}} & (8) \end{matrix}$

Then it is natural to ask if there are connections between the framework and the existing training framework based on MLE. In the following theorem, it is shown that they are inherently connected by the partial gradients. Theorem 1. The partial gradients of the loss function of framework (Eq. (8)) converge to those under the MLE framework as the number of noise samples per true sample K→+∞, with the following two mild assumptions: (1) the gradient ∇θpθ exists for all θ; (2) there exists an integrable function R(τ) which is the upper bound of max_(j)

$\frac{\partial{p_{\theta}(\tau)}}{\partial\theta_{j}}$

Proof. Given

_(i) and τ=(t, χ), t>t_(i), the definition of expectation

|∫(τ)|=∫_(χ∈)

∫

p(τ)f(τ)dtdχ is used to expand Eq. (8) as:

$\begin{matrix} {\frac{\partial{\mathcal{L}(\theta)}}{\partial\theta_{j}} = {\int_{}{\int_{t_{i}}^{+ \infty}{\frac{{{Kp}_{n}(\tau)}\left( {{p_{d}(\tau)} - {p_{\theta}(\tau)}} \right)}{{p_{\theta}(\tau)} + {{Kp}_{n}(\tau)}}\frac{{\partial\log}\; {p_{\theta}(\tau)}}{\partial\theta_{j}}{{dtdx}.}}}}} & (9) \end{matrix}$

When K_(→)+∞, we have

${\frac{{Kp}_{n}(\tau)}{{p_{\theta}(\tau)} + {{Kp}_{n}(\tau)}} = 1},$

thus:

$\begin{matrix} {{\lim\limits_{K\rightarrow{+ \infty}}\frac{\partial{\mathcal{L}(\theta)}}{\partial\theta_{j}}} = {\int_{}{\int_{t_{i}}^{+ \infty}{\left( {{p_{d}(\tau)} - {p_{\theta}(\tau)}} \right)\frac{{\partial\log}\; {p_{\theta}(\tau)}}{\partial\theta_{j}}{{dtdx}.}}}}} & (10) \end{matrix}$

Then the second term of in Eq. (10) is shown to vanish for all j as

$\begin{matrix} {{{{p_{\theta}(\tau)}\frac{{\partial\log}\; {p_{\theta}(\tau)}}{\partial\theta_{j}}} = {{\frac{\partial{p_{\theta}(\tau)}}{\partial\theta_{j}}\text{:}\mspace{14mu} {\int_{}{\int_{t_{i}}^{+ \infty}{\frac{\partial{p_{\theta}(\tau)}}{\partial\theta_{j}}{dtdx}}}}} = {\frac{\partial{\int_{}{\int_{t_{i}}^{+ \infty}{{p_{\theta}(\tau)}{dtdx}}}}}{\partial\theta_{j}} = {\frac{\partial 1}{\partial\theta_{j}} = 0}}}},} & (11) \end{matrix}$

Wherein, the Leibniz Rule is used to swap the order of partial derivation and integral. Moreover, it is known that ∫_(χ∈)

∫_(t) _(i) ^(+∞)p_(θ)(τ)dtdχ=1 because the likelihood p_(θ)(τ) is a well-defined probability density function. Therefore, in Eq. (10) what is left is:

$\begin{matrix} \begin{matrix} {{\lim\limits_{K\rightarrow\infty}\frac{\partial{\mathcal{L}(\theta)}}{\partial\theta_{j}}} = {\int_{}{\int_{t_{i}}^{+ \infty}{{p_{d}(\tau)}\frac{{\partial\log}\; {p_{\theta}(\tau)}}{\partial\theta_{j}}{dtdx}}}}} \\ {{= {_{\tau \sim p_{d}}\left\lbrack \frac{{\partial\log}\; {p_{\theta}(\tau)}}{\partial\theta_{j}} \right\rbrack}},} \end{matrix} & (12) \end{matrix}$

which is equivalent to the expectation of the gradient of MLE over the data distribution. This completes the proof.

Therefore, with a reasonable K and a proper p_(n), reducing the objective of exact MLE (Eq. (5)) to that of the framework, namely Eq. (7), does not significantly affect the gradients for model parameters θ in the learning process.

Next, a re-parametrization trick is introduced for the framework which can also be adapted to other NCE frameworks. With this trick, the strong assumptions of negative sampling are avoided: (1) p_(n) is independent of the history

(2)p_(n) is a uniform distribution s.t. Kp_(n)=1. Specifically, Eq. (4) can be rewritten as follows with

$\begin{matrix} {z^{\prime} = {{z^{\prime}\left( \tau \middle| \mathcal{H}_{i} \right)} = {{\frac{\exp \left( z_{\theta_{z}} \right)}{{Kp}_{n}(\tau)}\text{:}\underset{\theta}{\mspace{14mu} {\arg \; \min}}{\mathcal{L}(\theta)}} - {\sum\limits_{\tau}{\sum\limits_{j = 1}^{i - 1}{\left\lbrack {{\log \frac{{\lambda_{\theta_{\lambda}}\left( \tau_{j} \right)}z^{\prime}}{{{\lambda_{\theta_{\lambda}}\left( \tau_{j} \right)}z^{\prime}} + 1}} + {\sum\limits_{k = 1}^{K}{\log \frac{1}{{{\lambda_{\theta_{\lambda}}\left( T_{j,k}^{\prime} \right)}z^{\prime}} + 1}}}} \right\rbrack.}}}}}} & (13) \end{matrix}$

With the aforementioned re-parametrization trick, it can be directly learned z′ instead of z_(θ) _(z) . Thus, p_(n)(τ) does not need to be explicitly computed, which enables us to sidestep the constraint that p_(n) requires an analytical expression, which further expands the functional space to enable the search for the optima noise distribution p_(n).

Adaptive Noise Sample Generation

The framework enables the training of complex MTPP models with the principle of NCE. Nonetheless, the development of a sophisticated noise sample generation algorithm is still in its infancy. A novel algorithm for adaptive noise sample generation is discussed herein. The algorithm facilitates the training process of the framework where at least one noise event τ_(i,k) has to be generated for an observed event τ_(i). As p_(d) is a continuous joint distribution of time t and mark x, it is much more challenging to work out an intuitive p_(n) than the case of neural language models where p_(n) can be a simple univariate deterministic function of word frequency. It has been previously argued that p_(n) should be close to p_(d) because the more difficult the classification problem in Eq. (1) is, the more information model p_(θ) can be captured from the data distribution p_(d). Without arbitrary assumptions on p_(n), a principled way is proposed for adaptive noise generation. The algorithm adaptively pushes the implicit noise distribution p_(n) towards as p_(θ) catches more information from p_(d).

The key intuition of this algorithm is that, in the l-th iteration of the training process, the current MTPP model p_(θ) may not be good enough, so it can be used it to generate noise samples:

t′ _(i+1,k) ˜{circumflex over (t)} _(i+1)

(0,lσ ₀ ²), χ′_(i+)

_(,k)=

_(i+1),  (14)

where

and

are the predicted time and mark for the i+1-th event based on

_(i) and p_(θ). For example, in conventional MTPP models, S examples can be sampled by

Algorithm 1 Adaptive noise generation for INITIATOR. Input:  

 _(i), p_(θ)  1: Compute prediction {circumflex over (τ)}_(i+1) = ({circumflex over (t)}_(i+1), {circumflex over (x)}_(i+1))  2: for k = 1 to K do  3: Sample τ′_(i+1,k) by Eq. (14)  4: end for  5: return (τ′_(i+1,1), ..., τ′_(i+1,K)) {circumflex over (τ)}_(i+1,j)=({circumflex over (t)}_(i+1,j),

_(i+1,j))˜p_(θ)(τ|

_(i)), j=1, . . . , S and make predictions by estimating expectations:

${\hat{t}}_{i + 1} = {{\left\lbrack t_{i + 1} \right\rbrack} = {{\frac{1}{S}{\sum\limits_{j}{{\hat{t}}_{{i + 1},j}\mspace{14mu} {and}\mspace{14mu} {\hat{x}}_{i + 1}}}} = {{\left\lbrack x_{i + 1} \right\rbrack} = {\frac{1}{S}{\sum\limits_{j}{{\hat{x}}_{{i + 1},j}.}}}}}}$

It is discussed herein how the predictions can be made with neural MTPP models. The adaptive Gaussian noise is added to ensure that good predictions are not treated as noise samples. The variance increases with respect to the iteration number M because the model p_(θ) makes better predictions as the training process continues.

A Neural MTPP Model Based on the Framework

As discussed herein, an instantiation of the framework has been introduced with the state-of-the-art deep learning models (a.k.a. neural MTPP). Compared with conventional models, neural MTPP models handle sequential event data by vector representations. Specifically, a neural MTPP model maps the observed history of a sequence

_(i) to vector representation h_(i). In the designed model, dense layers are used to project the raw input into a multi-dimensional space. Then, Long Short Term Memory (LSTM) is used to capture the nonlinear dependencies between τ_(i) and

i=2, . . . , N. Consequently, the output of LSTM is regarded as the vector representation. Given input event τ_(i) and the corresponding noise samples

=1, . . . , K, the neural MTPP model is formulated set forth as below:

s _(i)[ϕ

(w _(i) t _(i) +b

),ϕ_(χ)(W _(χ)

_(i) +b _(χ))](h _(i) ,c _(i))=LSTM(s _(i) ,h _(i−1) ,c _(i−1)).  (15)

To train this model, an output model is required to map h_(i) to a scalar y which can be the conditional intensity λ^(t)(t_(i)|

_(i−1)), the ground conditional intensity λ(τ_(i)|

_(i−1)), the predicted time {circumflex over (t)}_(i+1) or the predicted mark χ_(i+1). Hence, the CIF of a neural MTPP model can be decomposed as λ_(θ) _(λ) (τ|

_(i−1))=g^(λ)(f(

_(i))). In the designed model, a dense layer performs the function of an output model g^(λ), mapping vector representation to conditional intensity. Then the output model can be framed as:

λ(τ_(i))=g ^(λ)(h _(i))=ϕ_(out)(w _(out) h _(i) +b _(out))  (16)

To compute the loss function of the framework, similar to the Siamese structure, dense layers and recurrent layers are shared between inputs—observed event τ_(i) and its noise τ′_(i,k). Finally, the conditional intensity of a true event λ(τ_(i)) and that for its noise samples λ(τ′_(i,k)) may be fed into the loss function of the framework (Eq. (13)).

Then, how adaptive noise generation is employed will be discussed. According to Algorithm 1, given the vector representation h_(i) and output models g^(t), g^(χ)=[g^(χ) ¹ , . . . , g

] trained to predict the i+1-th event based on a noise sample τ′_(i+1,k) is generated by:

t′ _(i+1,k) ˜g ^(t)(h _(i))+

(0,mσ ₀ ²), χ′_(i+1,)

=g ^(χ)(h _(i)).  (17)

Computing Device

FIG. 5 illustrates an example of a suitable computing device 100 which may be used to implement various aspects of the MTTP framework with adaptive noise sample generation described herein. More particularly, in some embodiments, aspects of the MTTP framework with adaptive noise sample generation may be translated to software or machine-level code, which may be installed to and/or executed by the computing device 100 such that the computing device 100 is configured to train one or more MTTP models to better estimate how likely it is that an event will occur in the future. It is contemplated that the computing device 100 may include any number of devices, such as personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronic devices, network PCs, minicomputers, mainframe computers, digital signal processors, state machines, logic circuitries, distributed computing environments, and the like.

The computing device 100 may include various hardware components, such as a processor 102, a main memory 104 (e.g., a system memory), and a system bus 101 that couples various components of the computing device 100 to the processor 102. The system bus 101 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. For example, such architectures may include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.

The computing device 100 may further include a variety of memory devices and computer-readable media 107 that includes removable/non-removable media and volatile/nonvolatile media and/or tangible media, but excludes transitory propagated signals. Computer-readable media 107 may also include computer storage media and communication media. Computer storage media includes removable/non-removable media and volatile/nonvolatile media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules or other data, such as RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to store the desired information/data and which may be accessed by the general purpose computing device. Communication media includes computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media may include wired media such as a wired network or direct-wired connection and wireless media such as acoustic, RF, infrared, and/or other wireless media, or some combination thereof. Computer-readable media may be embodied as a computer program product, such as software stored on computer storage media.

The main memory 104 includes computer storage media in the form of volatile/nonvolatile memory such as read only memory (ROM) and random access memory (RAM). A basic input/output system (BIOS), containing the basic routines that help to transfer information between elements within the general purpose computing device (e.g., during start-up) is typically stored in ROM. RAM typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processor 102. Further, a data storage 106 stores an operating system, application programs, and other program modules and program data.

The data storage 106 may also include other removable/non-removable, volatile/nonvolatile computer storage media. For example, data storage 106 may be: a hard disk drive that reads from or writes to non-removable, nonvolatile magnetic media; a magnetic disk drive that reads from or writes to a removable, nonvolatile magnetic disk; and/or an optical disk drive that reads from or writes to a removable, nonvolatile optical disk such as a CD-ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media may include magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The drives and their associated computer storage media provide storage of computer-readable instructions, data structures, program modules and other data for the general purpose computing device 100.

A user may enter commands and information through a user interface 140 (displayed via a monitor 160) by engaging input devices 145 such as a tablet, electronic digitizer, a microphone, keyboard, and/or pointing device, commonly referred to as mouse, trackball or touch pad. Other input devices 145 may include a joystick, game pad, satellite dish, scanner, or the like. Additionally, voice inputs, gesture inputs (e.g., via hands or fingers), or other natural user input methods may also be used with the appropriate input devices, such as a microphone, camera, tablet, touch pad, glove, or other sensor. These and other input devices 145 are in operative connection to the processor 102 and may be coupled to the system bus 101, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 160 or other type of display device is also connected to the system bus 101. The monitor 160 may also be integrated with a touch-screen panel or the like.

The computing device 100 may be implemented in a networked or cloud-computing environment using logical connections of a network interface 103 to one or more remote devices, such as a remote computer. The remote computer may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the general purpose computing device. The logical connection may include one or more local area networks (LAN) and one or more wide area networks (WAN), but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.

When used in a networked or cloud-computing environment, the computing device 100 may be connected to a public and/or private network through the network interface 103. In such embodiments, a modem or other means for establishing communications over the network is connected to the system bus 101 via the network interface 103 or other appropriate mechanism. A wireless networking component including an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a network. In a networked environment, program modules depicted relative to the general purpose computing device, or portions thereof, may be stored in the remote memory storage device.

Computing System

Referring to FIG. 6, in some embodiments the MTTP framework with adaptive noise generation described herein may be implemented at least in part by way of a computing system 200. In general, the computing system 200 may include a plurality of components, and may include at least one computing device 202, which may be equipped with at least one or more of the features of the computing device 100 described herein. As indicated, the computing device 202 may be configured to implement an MTTP framework 204 which may include a training module 206 for training one or more MTTP models and an adaptive noise sample generation algorithm 208 to facilitate the training. Aspects of the MTTP framework 204 may be implemented as code and/or machine-executable instructions executable by the computing device 202 that may represent one or more of a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements related to the above MTTP model training methods. A code segment of the MTTP framework 204 may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, or the like.

In other words, aspects of the MTTP framework may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks (e.g., a computer-program product) may be stored in a computer-readable or machine-readable medium, and a processor(s) associated with the computing device 202 may perform the tasks defined by the code.

As further shown, the system 200 may include at least one internet connected device 210 in operable communication with the computing device 202. In some embodiments, the internet connected device may provide sequential event data 212 to the computing device 202 for training purposes or real world prediction of future events. The internet connected device 210 may include any electronic device capable of accessing/tracking sequential event data such as social media activity over time. In addition, the system 200 may include a client application 220 which may be configured to provide aspects of the MTTP framework 204 to any number of client devices 222 via a network 224, such as the Internet, a local area network, a wide area network, a cloud environment, and the like.

Example embodiments described herein may be implemented at least in part in electronic circuitry; in computer hardware executing firmware and/or software instructions; and/or in combinations thereof. Example embodiments also may be implemented using a computer program product (e.g., a computer program tangibly or non-transitorily embodied in a machine-readable medium and including instructions for execution by, or to control the operation of, a data processing apparatus, such as, for example, one or more programmable processors or computers). A computer program may be written in any form of programming language, including compiled or interpreted languages, and may be deployed in any form, including as a stand-alone program or as a subroutine or other unit suitable for use in a computing environment. Also, a computer program can be deployed to be executed on one computer, or to be executed on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

Certain embodiments are described herein as including one or more modules 112. Such modules 112 are hardware-implemented, and thus include at least one tangible unit capable of performing certain operations and may be configured or arranged in a certain manner. For example, a hardware-implemented module 112 may comprise dedicated circuitry that is permanently configured (e.g., as a special-purpose processor, such as a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)) to perform certain operations. A hardware-implemented module 112 may also comprise programmable circuitry (e.g., as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software or firmware to perform certain operations. In some example embodiments, one or more computer systems (e.g., a standalone system, a client and/or server computer system, or a peer-to-peer computer system) or one or more processors may be configured by software (e.g., an application or application portion) as a hardware-implemented module 112 that operates to perform certain operations as described herein.

Accordingly, the term “hardware-implemented module” encompasses a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain manner and/or to perform certain operations described herein. Considering embodiments in which hardware-implemented modules 112 are temporarily configured (e.g., programmed), each of the hardware-implemented modules 112 need not be configured or instantiated at any one instance in time. For example, where the hardware-implemented modules 112 comprise a general-purpose processor configured using software, the general-purpose processor may be configured as respective different hardware-implemented modules 112 at different times. Software may accordingly configure a processor 102, for example, to constitute a particular hardware-implemented module at one instance of time and to constitute a different hardware-implemented module 112 at a different instance of time.

Hardware-implemented modules 112 may provide information to, and/or receive information from, other hardware-implemented modules 112. Accordingly, the described hardware-implemented modules 112 may be regarded as being communicatively coupled. Where multiple of such hardware-implemented modules 112 exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the hardware-implemented modules. In embodiments in which multiple hardware-implemented modules 112 are configured or instantiated at different times, communications between such hardware-implemented modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware-implemented modules 112 have access. For example, one hardware-implemented module 112 may perform an operation, and may store the output of that operation in a memory device to which it is communicatively coupled. A further hardware-implemented module 112 may then, at a later time, access the memory device to retrieve and process the stored output. Hardware-implemented modules 112 may also initiate communications with input or output devices.

Experiments

As discussed herein, experiments were conducted to evaluate the performance of the framework. In particular, an attempt was made to answer the following two research questions: (1) how accurate can the framework predict the exact timestamp of the event; and (2) what type of event may occur in the near future. Before the details of experiments, the datasets and experimental settings are introduced.

Dataset Description

Three real-world sequential event datasets were collected to answer the above two proposed research questions.

Citi Bike.

Citi Bike shares bikes at stations across New York and New Jersey. The activities for a certain bike form a sequence of events. The training set and test set contain the records of the bikes in Jersey City from January to August 2017 and that of September 2017, respectively. Our task is to predict destination of the next ride and its arrival time.

Retweet.

10,000 retweet streams were randomly from the Seimic dataset and perform a 5-fold cross-validation. Each stream of retweets for a novel tweet is a sequence of events. The task is to predict the retweet time and the associated class label.

Financial.

This dataset contains sequences of financial events from a stock traded in US. To avoid bias in the original dataset, we ensure the length of sequences to be the same by using the first 800 events of each sequence. Then a 5-fold cross-validation is carried out. The task is to predict time and mark (buy or sell) for the next event.

In these datasets, each event only comes with a discrete unidimensional mark. These statistics are shown in Table 1: mean and standard deviation of time interval between consecutive events (μ_(t) and σ_(t)), the number of unique values for a mark (|X|), average sequence lengths (μ_(M)) and the number of events for training and test.

TABLE 1 Statistics of ths datasets Dataset Citi Bike Retweet Financial μ_(t) 8.135e2(s) 3.228e4(s) 0.619(ms) σ_(t) 1.157e4(s) 7.839e4(s) 3.117(ms) |χ| 132 3 2 μM 1.839e2 1.458e2  8.00e2 training events 3.299e5 1.468e6 6.400e5 test events 3.299e4 3.716e5 1.600e5

Experimental Settings

Training is carried out with mini-batches while experimental results of the whole test set are reported. All experiments are repeated 10 times. ADAM is the optimizer we use. In addition, ReLU was selected as the activation function (ϕ_(t), ϕ_(χ) and ϕ_(out)). In terms of the initialization, the cell state of LSTM, weights of LSTM and weights of dense layers are set to be 0, the truncated normal distribution and the Xavier initialization, respectively. Grid search is used for optimal hyperparameters. Specifically, learning rate was searched in {0.01, 0.001, 0.0001}, number of units in dense layers in {1, 10, 100}, LSTM state cell size in {32, 64, 128, 256}, batch size in {16, 32, 64} and the number of noise samples per true event in {1, 2, 5, 10}. Three strategies were adopted for the re-parametrized normalizer z′: (1) z′=1 was set as constant; (2) z′ was set as a single parameter to learn, which is also independent of

_(i); (3) it was learned that z′=g^(z)(h_(i)) as a function of the vector representation of

_(i).

Baselines. To assess the effectiveness of the framework, the framework was compared with the following variants and state-of-the-art frameworks for training neural MTPP models. For a fair comparison, we use the same input layers, recurrent layers and output layers on vector representations for time, mark, CIF, and ground CIF. It is worthwhile to note that TPP models such as seismic cannot be considered as baselines as their inability to model mark types along with timing information.

-   -   NCE-P: A variant of framework in which we sample t′_(i,k) from         homogeneous Poisson process.     -   NCE-G: A modified of the framework in which t′_(i,k) are samples         from Gaussian distributions.     -   DIS: DIS trains a MTPP model with discriminative loss functions,         i.e.,     -   MSE on time and cross entropy on marks.     -   MLE: With an integrable ground CIF, MLE maximizes the likelihood         of time exactly and minimizes the cross entropy on marks.     -   MCMLE: MCMLE trains a MTPP model by maximizing likelihood         approximately through Monte Carlo sampling.     -   Evaluation Metrics:

For time prediction, we evaluate different methods by the root mean squared error (RMSE) and the mean absolute error (MAE), which are widely adopted to measure the performance of regression algorithms. For mark prediction, as we only have unidimensional discrete marks in the datasets, the performance is measured through two widely used metrics for classification: micro-F1 and macro-F1.

Experimental Results and Discussion

Experiments were conducted with the three aforementioned datasets on two research tasks: (1) time prediction; and (2) mark prediction. The comparison results w.r.t. the time prediction are shown in FIGS. 3A-3F and the results with respect to the mark prediction are presented in FIGS. 4A-4F.

The following observations from these figures:

-   -   In nearly all cases, the proposed training framework         outperformed its variants and the state-of-the-art approaches         for both prediction tasks measured by the four metrics.         One-tailed T-test was conducted to compare the performance of         the framework and other methods. T-test results indicate that         framework is significantly better than the baselines with a         significant level of 0.05.     -   Benefiting from the adaptive noise generation, the framework         performed better than NCE-P and NCE-G as the noise samples         generated by INITIATOR forces the MTPP model to capture more         from the data distribution.     -   The framework outperformed MLE in most of the cases as MLE         specifies an integrable function as its ground CIF, which limits         the functional space MLE can search.     -   Results show that the framework is better than MCMLE. This is         because Monte Carlo sampling can lead to biased estimations of         the likelihood function, while Theorem 1 shows that INITIATOR         estimates the likelihood in a more principled way.     -   For time prediction over these datasets, RMSE is much larger         than MAE as the values of σt for the datasets are large w.r.t.         the values of μ_(t) and RMSE penalizes large errors much more         than MAE does.

It should be understood from the foregoing that, while particular embodiments have been illustrated and described, various modifications can be made thereto without departing from the spirit and scope of the invention as will be apparent to those skilled in the art. Such changes and modifications are within the scope and teachings of this invention as defined in the claims appended hereto. 

What is claimed is:
 1. A method of improving the training of marked temporal point process models, comprising: utilizing a processor in communication with a tangible storage medium storing instructions that are executed by the processor to perform operations comprising: accessing a model that expresses sequential events using a marked temporal point process, the model configured to formulate a prediction of a future event and associated future event characteristics based on sequential event data applied to the model; training the model, by: identifying event samples P^(d) and initial noise samples P^(n) from a data distribution associated with the model, generating additional noise samples from the data distribution, and utilizing the additional noise samples to re-parameterize the model to adaptively push noise distribution associated with P^(n) towards P^(d) as the model accesses more information from the data distribution.
 2. The method of claim 1, further comprising: generating the additional noise samples, by: computing values for a predicted time and a predicted mark by using as inputs at least a history sequence

_(i) and a parametric model p_(θ); and generating, iteratively, a predetermined number of times a noise sample inputting at least the predicted time and predicted mark x, and utilizing a probability density function inputting at least a value of variance and a value of mean which increase with respect to a number of iterations.
 3. The method of claim 2, wherein the probability density function outputs an adaptive Gaussian noise distribution.
 4. The method of claim 2, wherein a neural deep learning model maps an observed history of the history sequence

_(i) to a vector representation h_(i).
 5. The method of claim 1, wherein the event samples p_(d) are continuous joint distributions of a time value t and predicted mark x.
 6. The method of claim 1, further comprising generating at least one noise event for an observed event.
 7. The method of claim 1, wherein a plurality of dense layers are applied to project a raw input into a multi-dimensional space.
 8. A processor, configured to: access a model that expresses sequential events using a marked temporal point process, the model configured to formulate a prediction of a future event and associated future event characteristics based on sequential event data applied to the model; train the model using adaptive noise sample generation, by: identifying event samples P^(d) and initial noise samples P^(n) from a data distribution associated with the model, generating additional noise samples from the data distribution, and utilizing the additional noise samples to re-parameterize the model to adaptively push noise distribution associated with P^(n) towards P^(d) as the model accesses more information from the data distribution.
 9. The processor of claim 8 further configured to generate the additional noise samples by: computing values for a predicted time and a predicted mark by using as inputs at least a history sequence

_(i) and a parametric model p_(θ); and generating, iteratively, a predetermined number of times a noise sample inputting at least the predicted time and predicted mark x, and utilizing a probability density function inputting at least a value of variance and a value of mean which increase with respect to a number of iterations.
 10. The processor of claim 9, wherein the probability density function outputs an adaptive Gaussian noise distribution.
 11. The processor of claim 9, further configured to employ a neural deep learning model that maps an observed history of the history sequence

_(i) to a vector representation h_(i).
 12. The processor of claim 8, wherein the event samples p_(d) are continuous joint distributions of a time value t and predicted mark x.
 13. The processor of claim 8, further configured to generate at least one noise event for an observed event.
 14. The processor of claim 8, further configured to apply a plurality of dense layers to project a raw input into a multi-dimensional space. 